Hidden Correlation Discovery: Towards the Automation of an Analysis System
نویسنده
چکیده
IT Analytics company, Sumerian Ltd, has undertaken a 2-year project to automate and embed statistical techniques into its processes. Their current automated processes rely on correlation to highlight relationships between metric pairs. It is proposed that the nature of the data often obscures instances of high correlation, and that these may be revealed through the use of clustering analysis. Clustering analysis is a technique for dividing datasets into groups so that similar items are together and those displaying differences are separated. Like many data analysis methodologies, such techniques are often better suited to theoretical situations and are ill-equipped to perform well against the real-world situation of 'messy', massive-scale data. This project therefore aims to adapt simple clustering techniques to improve on the processes at Sumerian, by investigating different algorithms aimed at uncovering hidden correlation patterns within their datasets. Using the well-known k-means clustering algorithm as a starting point and baseline, we will attempt to find: highly-correlated subsets of any metric-pair without any categorical information; correlation in certain 'windows' of time within the data; and differences in the patterns of correlation between peak and other processing periods. Investigations into these developments will take in Hill Climbing and Genetic Algorithms. Overall, no single solution was found to fully meet the company's needs, but several avenues of future research have been uncovered, and recommendations are made on how to continue with the work. 3 Acknowledgements Huge thanks must go to my supervisor, Professor David Corne, for his support and guidance throughout both this and the wider KTP project, which would not have been remotely successful without his participation and encouragement. I would also like to thank Sumerian for allowing me to use my work with them as the basis for this dissertation, and particularly Chris Playford and George Theologou for rescuing me from total isolation during the work! I, Sarah Little, confirm that this work submitted for assessment is my own and is expressed in my words. Any uses made within it of the words of other authors in any form e.g. ideas, equations, figures, text, tables, programs are properly acknowledged at any point of their use. A list of the references employed is included.
منابع مشابه
Comparative Reliability Analysis of Substation Automation Architecture Based on IEC 61850 Standard
Using IEC 61850 standard would increase the reliability and availability of electricity network and put a huge impact on network automation. Even though much research works has been done in substation system reliability, there is a few works in automated substation control system reliability. This paper evaluates the reliability of substation automation system based IEC 61850 comparatively cons...
متن کاملIn-silico Metabolome Target Analysis Towards PanC-based Antimycobacterial Agent Discovery
Mycobacterium tuberculosis, the main cause of tuberculosis (TB), has still remained a global health crisis especially in developing countries. Tuberculosis treatment is a laborious and lengthy process with high risk of non compliance, cytotoxicity adverse events and drug resistance in patient. Recently, there has been an alarming rise of drug resistant in TB. In this regard, it is an unmet need...
متن کاملAlert correlation and prediction using data mining and HMM
Intrusion Detection Systems (IDSs) are security tools widely used in computer networks. While they seem to be promising technologies, they pose some serious drawbacks: When utilized in large and high traffic networks, IDSs generate high volumes of low-level alerts which are hardly manageable. Accordingly, there emerged a recent track of security research, focused on alert correlation, which ext...
متن کاملIntrusion Detection Using Evolutionary Hidden Markov Model
Intrusion detection systems are responsible for diagnosing and detecting any unauthorized use of the system, exploitation or destruction, which is able to prevent cyber-attacks using the network package analysis. one of the major challenges in the use of these tools is lack of educational patterns of attacks on the part of the engine analysis; engine failure that caused the complete training, ...
متن کاملKnowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services
The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...
متن کامل